Skip to content

Conversation

karasikov
Copy link
Member

@karasikov karasikov commented Oct 3, 2025

  • Support hosting multiple graphs
  • Filter which ones to query with field graphs in the request json
  • All query requests in the multi-graph mode share the same thread pool for querying individual graphs.

Usage:
./metagraph server_query indexes.txt --mmap --parallel 50

where indexes.txt is a csv file with graph names and paths:

human,human_chunk1.dbg,human_chunk1.row_diff_flat.annodbg
human,human_chunk2.dbg,human_chunk2.row_diff_flat.annodbg
microbe,microbe.dbg,microbe.row_diff_flat.annodbg
...

@karasikov karasikov requested a review from adamant-pwn October 10, 2025 20:23
@karasikov karasikov changed the title support hosting multiple graphs in one process multi-graph server: support hosting multiple graphs in one process Oct 10, 2025
abundance_sum: bool = False,
query_counts: bool = False,
query_coords: bool = False,
graphs: Union[None, List[str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a bit confusing compared to self.graphs in this class. Are we sure we want to keep this naming? Maybe rename the parameter graphs in API to labels, or similar?

That aside, how are MultiGraphClients actually used? Do we theoretically expect a situation in which e.g. first graph serves label A while the second graph serves labels B and C, and we want to query just label A or just labels A and B?

Judging by the unit test below, this would fail. Is it the outcome we want, rather than e.g. returning empty output (and possibly some kind of warning in logs) when querying a label that we don't know?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, true. We can try to come up with a different name. Otherwise, MultiGraphClient is essentially a GraphClient with a ThreadPool, and it's not used all that migh right now. Maybe it's simple enough to remove. Anyone can always add a ThreadPool on top any time

assert(json.size() == result.size());
for (Json::ArrayIndex i = 0; i < result.size(); ++i) {
for (auto&& value : json[i]["results"]) {
result[i]["results"].append(std::move(value));
Copy link
Contributor

@adamant-pwn adamant-pwn Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put some limitations on final result size, and e.g. throw an exception if we detect that it became too large to handle? Also, might be nice to collect some statistics on how much time we spend querying graphs vs merging results, as the latter is effectively single-threaded due to locking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just does a push back to a vector. It should be extremely fast. Space-wise also. I think there is no need to worry about that right now.

Copy link
Contributor

@adamant-pwn adamant-pwn Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing to vector should be fast, yes, I'm just concerned about the growth rate of the output. if a query is something that occurs everywhere, and it produces, say, 1 MiB of output on one chunk, it would elevate to ~2 GiB over 2k+ chunks, and then we also send it back over HTTP, right? Do we have something in place to guarantee scenarios like this don't happen? Or are we fine with them?

Speaking of that,

  • Is it guaranteed that the order of elements (over which Json::ArrayIndex go) is the same in all chunk results, and not possibly shuffled due to multithreading? I remember it being shuffled in AWS tests when output was in tsv format, so I had to additionally sort before merging.
  • We apply config.num_top_labels to all chunks individually, right? But should we maybe also apply it during merging, repeatedly discarding matches outside of num_top_labels? Given that we try to simulate a single combined graph on all chunks. This would also ensure result size stays on the order of magnitude of one chunk when this option is provided.
  • AFAIK, when we use num_top_labels, individual chunks return them in sorted order by the number of matches. I assume we might also need to re-sort them at the end after merging here, to preserve this format? Or is it ensured by JSON already?

I also don't know if there are other query parameters that we may need to account for during the merging stage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point, I need to add the num_top_labels filter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit trickier than I thought because of the many different result types (counts/coords/presence bitmap/etc.).
Let's do this in another PR, so we don't delay merging this.

I'm thinking of two things:

  1. Always return sorted results in the server mode
  2. Filter by top_labels

@karasikov karasikov requested a review from adamant-pwn October 11, 2025 21:29
@karasikov karasikov merged commit cbb3891 into master Oct 14, 2025
17 checks passed
@karasikov karasikov deleted the mk/api branch October 14, 2025 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants